On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

نویسندگان

Guizhen Yang

Saikat Mukherjee

I. V. Ramakrishnan

چکیده

Machine learning techniques for data extraction from semistructured sources exhibit different precision and recall characteristics. However to date the formal relationship between learning algorithms and their impact on these two metrics remains unexplored. This paper proposes a formalization of precision and recall of extraction and investigates the complexity-theoretic aspects of learning algorithms for multi-attribute data extraction based on this formalism. We show that there is a tradeoff between precision/recall of extraction and computational efficiency and present experimental results to demonstrate the practical utility of these concepts in designing scalable data extraction algorithms for improving recall without compromising on precision.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of Updating Methods in Building Blocks Dataset

With the increasing use of spatial data in daily life, the production of this data from diverse information sources with different precision and scales has grown widely. Generating new data requires a great deal of time and money. Therefore, one solution is to reduce costs is to update the old data at different scales using new data (produced on a similar scale). One approach to updating data i...

متن کامل

AMBER: Automatic Supervision for Multi-Attribute Extraction

The extraction of multi-attribute objects from the deep web is the bridge between the unstructured web and structured data. Existing approaches either induce wrappers from a set of human-annotated pages or leverage repeated structures on the page without supervision. What the former lack in automation, the latter lack in accuracy. Thus accurate, automatic multi-attribute object extraction has r...

متن کامل

Product information extraction from semistructured documents using HMMs

In this paper we present preliminary results for information extraction (IE) performed over a set of HTML documents using Hidden Markov Models (HMMs). In our experiments, we restrict ourselves to the domain of bike products sold on the Internet. The information to be extracted consists of bike model attributes and details regarding the company’s offer. We experiment with a simple extension to H...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

نویسندگان

چکیده

منابع مشابه

Evaluation of Updating Methods in Building Blocks Dataset

AMBER: Automatic Supervision for Multi-Attribute Extraction

Product information extraction from semistructured documents using HMMs

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

عنوان ژورنال:

اشتراک گذاری